Computation and Language 34
☆ Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi
Since the release of T\"ULU [Wang et al., 2023b], open resources for
instruction tuning have developed quickly, from better base models to new
finetuning techniques. We test and incorporate a number of these advances into
T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing
the understanding and best practices of adapting pretrained language models to
downstream tasks and user preferences. Concretely, we release: (1)
T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2)
T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU
2 models trained with direct preference optimization (DPO), including the
largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE
LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its
instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple
perspectives shows that the T\"ULU 2 suite achieves state-of-the-art
performance among open models and matches or exceeds the performance of
GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data,
training and evaluation code to facilitate future open efforts on adapting
large language models.
comment: technical report
☆ PEFT-MedAware: Large Language Model for Medical Awareness
Chat models are capable of answering a wide range of questions, however, the
accuracy of their responses is highly uncertain. In this research, we propose a
specialized PEFT-MedAware model where we utilize parameter-efficient
fine-tuning (PEFT) to enhance the Falcon-1b large language model on specialized
MedQuAD data consisting of 16,407 medical QA pairs, leveraging only 0.44% of
its trainable parameters to enhance computational efficiency. The paper adopts
data preprocessing and PEFT to optimize model performance, complemented by a
BitsAndBytesConfig for efficient transformer training. The resulting model was
capable of outperforming other LLMs in medical question-answering tasks in
specific domains with greater accuracy utilizing limited computational
resources making it suitable for deployment in resource-constrained
environments. We propose further improvements through expanded datasets, larger
models, and feedback mechanisms for sustained medical relevancy. Our work
highlights the efficiency gains and specialized capabilities of PEFT in medical
AI, outpacing standard models in precision without extensive resource demands.
The proposed model and data are released for research purposes only.
comment: 7 pages, 1 figure, submitted to the Artificial Intelligence in
Medicine Journal
☆ Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers AAAI24
This work presents an analysis of the effectiveness of using standard shallow
feed-forward networks to mimic the behavior of the attention mechanism in the
original Transformer model, a state-of-the-art architecture for
sequence-to-sequence tasks. We substitute key elements of the attention
mechanism in the Transformer with simple feed-forward networks, trained using
the original components via knowledge distillation. Our experiments, conducted
on the IWSLT2017 dataset, reveal the capacity of these "attentionless
Transformers" to rival the performance of the original architecture. Through
rigorous ablation studies, and experimenting with various replacement network
types and sizes, we offer insights that support the viability of our approach.
This not only sheds light on the adaptability of shallow feed-forward networks
in emulating attention mechanisms but also underscores their potential to
streamline complex architectures for sequence-to-sequence tasks.
comment: Accepted at AAAI24(https://aaai.org/aaai-conference/)
☆ A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest
Ruohong Zhang, Luyu Gao, Chen Zheng, Zhen Fan, Guokun Lai, Zheng Zhang, Fangzhou Ai, Yiming Yang, Hongxia Yang
Large Language Models (LLMs), despite their great power in language
generation, often encounter challenges when dealing with intricate and
knowledge-demanding queries in specific domains. This paper introduces a novel
approach to enhance LLMs by effectively extracting the relevant knowledge from
domain-specific textual sources, and the adaptive training of a chatbot with
domain-specific inquiries. Our two-step approach starts from training a
knowledge miner, namely LLMiner, which autonomously extracts Question-Answer
pairs from relevant documents through a chain-of-thought reasoning process.
Subsequently, we blend the mined QA pairs with a conversational dataset to
fine-tune the LLM as a chatbot, thereby enriching its domain-specific expertise
and conversational capabilities. We also developed a new evaluation benchmark
which comprises four domain-specific text corpora and associated human-crafted
QA pairs for testing. Our model shows remarkable performance improvement over
generally aligned LLM and surpasses domain-adapted models directly fine-tuned
on domain corpus. In particular, LLMiner achieves this with minimal human
intervention, requiring only 600 seed instances, thereby providing a pathway
towards self-improvement of LLMs through model-synthesized training data.
comment: Work in progress
☆ Hashing it Out: Predicting Unhealthy Conversations on Twitter
Personal attacks in the context of social media conversations often lead to
fast-paced derailment, leading to even more harmful exchanges being made.
State-of-the-art systems for the detection of such conversational derailment
often make use of deep learning approaches for prediction purposes. In this
paper, we show that an Attention-based BERT architecture, pre-trained on a
large Twitter corpus and fine-tuned on our task, is efficient and effective in
making such predictions. This model shows clear advantages in performance to
the existing LSTM model we use as a baseline. Additionally, we show that this
impressive performance can be attained through fine-tuning on a relatively
small, novel dataset, particularly after mitigating overfitting issues through
synthetic oversampling techniques. By introducing the first transformer based
model for forecasting conversational events on Twitter, this work lays the
foundation for a practical tool to encourage better interactions on one of the
most ubiquitous social media platforms.
comment: 7 pages, 3 figures, academic
☆ Countering Misinformation via Emotional Response Generation EMNLP 2023
The proliferation of misinformation on social media platforms (SMPs) poses a
significant danger to public health, social cohesion and ultimately democracy.
Previous research has shown how social correction can be an effective way to
curb misinformation, by engaging directly in a constructive dialogue with users
who spread -- often in good faith -- misleading messages. Although professional
fact-checkers are crucial to debunking viral claims, they usually do not engage
in conversations on social media. Thereby, significant effort has been made to
automate the use of fact-checker material in social correction; however, no
previous work has tried to integrate it with the style and pragmatics that are
commonly employed in social media communication. To fill this gap, we present
VerMouth, the first large-scale dataset comprising roughly 12 thousand
claim-response pairs (linked to debunking articles), accounting for both
SMP-style and basic emotions, two factors which have a significant role in
misinformation credibility and spreading. To collect this dataset we used a
technique based on an author-reviewer pipeline, which efficiently combines LLMs
and human annotators to obtain high-quality data. We also provide comprehensive
experiments showing how models trained on our proposed dataset have significant
improvements in terms of output quality and generalization capabilities.
comment: Accepted to EMNLP 2023 main conference
☆ Detection of Offensive and Threatening Online Content in a Low Resource Language
Hausa is a major Chadic language, spoken by over 100 million people in
Africa. However, from a computational linguistic perspective, it is considered
a low-resource language, with limited resources to support Natural Language
Processing (NLP) tasks. Online platforms often facilitate social interactions
that can lead to the use of offensive and threatening language, which can go
undetected due to the lack of detection systems designed for Hausa. This study
aimed to address this issue by (1) conducting two user studies (n=308) to
investigate cyberbullying-related issues, (2) collecting and annotating the
first set of offensive and threatening datasets to support relevant downstream
tasks in Hausa, (3) developing a detection system to flag offensive and
threatening content, and (4) evaluating the detection system and the efficacy
of the Google-based translation engine in detecting offensive and threatening
terms in Hausa. We found that offensive and threatening content is quite
common, particularly when discussing religion and politics. Our detection
system was able to detect more than 70% of offensive and threatening content,
although many of these were mistranslated by Google's translation engine. We
attribute this to the subtle relationship between offensive and threatening
content and idiomatic expressions in the Hausa language. We recommend that
diverse stakeholders participate in understanding local conventions and
demographics in order to develop a more effective detection system. These
insights are essential for implementing targeted moderation strategies to
create a safe and inclusive online environment.
comment: 25 pages, 5 figures, 8 tables
☆ When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content
Many under-resourced languages require high-quality datasets for specific
tasks such as offensive language detection, disinformation, or misinformation
identification. However, the intricacies of the content may have a detrimental
effect on the annotators. The article aims to revisit an approach of
pseudo-labeling sensitive data on the example of Ukrainian tweets covering the
Russian-Ukrainian war. Nowadays, this acute topic is in the spotlight of
various language manipulations that cause numerous disinformation and profanity
on social media platforms. The conducted experiment highlights three main
stages of data annotation and underlines the main obstacles during machine
annotation. Ultimately, we provide a fundamental statistical analysis of the
obtained data, evaluation of models used for pseudo-labelling, and set further
guidelines on how the scientists can leverage the corpus to execute more
advanced research and extend the existing data samples without annotators'
engagement.
comment: Ukrainian language, pseudo-labelling, dataset, offensive-language
☆ CNL2ASP: converting controlled natural language sentences into ASP
Answer Set Programming (ASP) is a popular declarative programming language
for solving hard combinatorial problems. Although ASP has gained widespread
acceptance in academic and industrial contexts, there are certain user groups
who may find it more advantageous to employ a higher-level language that
closely resembles natural language when specifying ASP programs. In this paper,
we propose a novel tool, called CNL2ASP, for translating English sentences
expressed in a controlled natural language (CNL) form into ASP. In particular,
we first provide a definition of the type of sentences allowed by our CNL and
their translation as ASP rules, and then exemplify the usage of the CNL for the
specification of both synthetic and real-world combinatorial problems. Finally,
we report the results of an experimental analysis conducted on the real-world
problems to compare the performance of automatically generated encodings with
the ones written by ASP practitioners, showing that our tool can obtain
satisfactory performance on these benchmarks. Under consideration in Theory and
Practice of Logic Programming (TPLP).
comment: Under consideration in Theory and Practice of Logic Programming
(TPLP)
☆ Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language
Since their inception, embeddings have become a primary ingredient in many
flavours of Natural Language Processing (NLP) tasks supplanting earlier types
of representation. Even though multilingual embeddings have been used for the
increasing number of multilingual tasks, due to the scarcity of parallel
training data, low-resource languages such as Sinhala, tend to focus more on
monolingual embeddings. Then when it comes to the aforementioned multi-lingual
tasks, it is challenging to utilize these monolingual embeddings given that
even if the embedding spaces have a similar geometric arrangement due to an
identical training process, the embeddings of the languages considered are not
aligned. This is solved by the embedding alignment task. Even in this,
high-resource language pairs are in the limelight while low-resource languages
such as Sinhala which is in dire need of help seem to have fallen by the
wayside. In this paper, we try to align Sinhala and English word embedding
spaces based on available alignment techniques and introduce a benchmark for
Sinhala language embedding alignment. In addition to that, to facilitate the
supervised alignment, as an intermediate task, we also introduce
Sinhala-English alignment datasets. These datasets serve as our anchor datasets
for supervised word embedding alignment. Even though we do not obtain results
comparable to the high-resource languages such as French, German, or Chinese,
we believe our work lays the groundwork for more specialized alignment between
English and Sinhala embeddings.
☆ Causal Graph in Language Model Rediscovers Cortical Hierarchy in Human Narrative Processing
Understanding how humans process natural language has long been a vital
research direction. The field of natural language processing (NLP) has recently
experienced a surge in the development of powerful language models. These
models have proven to be invaluable tools for studying another complex system
known to process human language: the brain. Previous studies have demonstrated
that the features of language models can be mapped to fMRI brain activity. This
raises the question: is there a commonality between information processing in
language models and the human brain? To estimate information flow patterns in a
language model, we examined the causal relationships between different layers.
Drawing inspiration from the workspace framework for consciousness, we
hypothesized that features integrating more information would more accurately
predict higher hierarchical brain activity. To validate this hypothesis, we
classified language model features into two categories based on causal network
measures: 'low in-degree' and 'high in-degree'. We subsequently compared the
brain prediction accuracy maps for these two groups. Our results reveal that
the difference in prediction accuracy follows a hierarchical pattern,
consistent with the cortical hierarchy map revealed by activity time constants.
This finding suggests a parallel between how language models and the human
brain process linguistic information.
comment: 15 pages, 16 figures
☆ Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads
Transformer-based pretrained large language models (PLM) such as BERT and GPT
have achieved remarkable success in NLP tasks. However, PLMs are prone to
encoding stereotypical biases. Although a burgeoning literature has emerged on
stereotypical bias mitigation in PLMs, such as work on debiasing gender and
racial stereotyping, how such biases manifest and behave internally within PLMs
remains largely unknown. Understanding the internal stereotyping mechanisms may
allow better assessment of model fairness and guide the development of
effective mitigation strategies. In this work, we focus on attention heads, a
major component of the Transformer architecture, and propose a bias analysis
framework to explore and identify a small set of biased heads that are found to
contribute to a PLM's stereotypical bias. We conduct extensive experiments to
validate the existence of these biased heads and to better understand how they
behave. We investigate gender and racial bias in the English language in two
types of Transformer-based PLMs: the encoder-based BERT model and the
decoder-based autoregressive GPT model. Overall, the results shed light on
understanding the bias behavior in pretrained language models.
☆ FOAL: Fine-grained Contrastive Learning for Cross-domain Aspect Sentiment Triplet Extraction
Aspect Sentiment Triplet Extraction (ASTE) has achieved promising results
while relying on sufficient annotation data in a specific domain. However, it
is infeasible to annotate data for each individual domain. We propose to
explore ASTE in the cross-domain setting, which transfers knowledge from a
resource-rich source domain to a resource-poor target domain, thereby
alleviating the reliance on labeled data in the target domain. To effectively
transfer the knowledge across domains and extract the sentiment triplets
accurately, we propose a method named Fine-grained cOntrAstive Learning (FOAL)
to reduce the domain discrepancy and preserve the discriminability of each
category. Experiments on six transfer pairs show that FOAL achieves 6%
performance gains and reduces the domain discrepancy significantly compared
with strong baselines. Our code will be publicly available once accepted.
☆ Exploring the Relationship between In-Context Learning and Instruction Tuning
In-Context Learning (ICL) and Instruction Tuning (IT) are two primary
paradigms of adopting Large Language Models (LLMs) to downstream applications.
However, they are significantly different. In ICL, a set of demonstrations are
provided at inference time but the LLM's parameters are not updated. In IT, a
set of demonstrations are used to tune LLM's parameters in training time but no
demonstrations are used at inference time. Although a growing body of
literature has explored ICL and IT, studies on these topics have largely been
conducted in isolation, leading to a disconnect between these two paradigms. In
this work, we explore the relationship between ICL and IT by examining how the
hidden states of LLMs change in these two paradigms. Through carefully designed
experiments conducted with LLaMA-2 (7B and 13B), we find that ICL is implicit
IT. In other words, ICL changes an LLM's hidden states as if the demonstrations
were used to instructionally tune the model. Furthermore, the convergence
between ICL and IT is largely contingent upon several factors related to the
provided demonstrations. Overall, this work offers a unique perspective to
explore the connection between ICL and IT and sheds light on understanding the
behaviors of LLM.
☆ Complementary Advantages of ChatGPTs and Human Readers in Reasoning: Evidence from English Text Reading Comprehension
ChatGPT has shown its great power in text processing, including its reasoning
ability from text reading. However, there has not been any direct comparison
between human readers and ChatGPT in reasoning ability related to text reading.
This study was undertaken to investigate how ChatGPTs (i.e., ChatGPT and
ChatGPT Plus) and Chinese senior school students as ESL learners exhibited
their reasoning ability from English narrative texts. Additionally, we compared
the two ChatGPTs in the reasoning performances when commands were updated
elaborately. The whole study was composed of three reasoning tests: Test 1 for
commonsense inference, Test 2 for emotional inference, and Test 3 for causal
inference. The results showed that in Test 1, the students outdid the two
ChatGPT versions in local-culture-related inferences but performed worse than
the chatbots in daily-life inferences. In Test 2, ChatGPT Plus excelled whereas
ChatGPT lagged behind in accuracy. In association with both accuracy and
frequency of correct responses, the students were inferior to the two chatbots.
Compared with ChatGPTs' better performance in positive emotions, the students
showed their superiority in inferring negative emotions. In Test 3, the
students demonstrated better logical analysis, outdoing both chatbots. In
updating command condition, ChatGPT Plus displayed good causal reasoning
ability while ChatGPT kept unchanged. Our study reveals that human readers and
ChatGPTs have their respective advantages and disadvantages in drawing
inferences from text reading comprehension, unlocking a complementary
relationship in text-based reasoning.
☆ Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking
Continual learning is crucial for dialog state tracking (DST) in dialog
systems, since requirements from users for new functionalities are often
encountered. However, most of existing continual learning methods for DST
require task identities during testing, which is a severe limit in real-world
applications. In this paper, we aim to address continual learning of DST in the
class-incremental scenario (namely the task identity is unknown in testing).
Inspired by the recently emerging prompt tuning method that performs well on
dialog systems, we propose to use the prompt pool method, where we maintain a
pool of key-value paired prompts and select prompts from the pool according to
the distance between the dialog history and the prompt keys. The proposed
method can automatically identify tasks and select appropriate prompts during
testing. We conduct experiments on Schema-Guided Dialog dataset (SGD) and
another dataset collected from a real-world dialog application. Experiment
results show that the prompt pool method achieves much higher joint goal
accuracy than the baseline. After combining with a rehearsal buffer, the model
performance can be further improved.
☆ Energy and Carbon Considerations of Fine-Tuning BERT EMNLP 2023
Despite the popularity of the `pre-train then fine-tune' paradigm in the NLP
community, existing work quantifying energy costs and associated carbon
emissions has largely focused on language model pre-training. Although a single
pre-training run draws substantially more energy than fine-tuning, fine-tuning
is performed more frequently by many more individual actors, and thus must be
accounted for when considering the energy and carbon footprint of NLP. In order
to better characterize the role of fine-tuning in the landscape of energy and
carbon emissions in NLP, we perform a careful empirical study of the
computational costs of fine-tuning across tasks, datasets, hardware
infrastructure and measurement modalities. Our experimental results allow us to
place fine-tuning energy and carbon costs into perspective with respect to
pre-training and inference, and outline recommendations to NLP researchers and
practitioners who wish to improve their fine-tuning energy efficiency.
comment: EMNLP 2023 Findings; First two authors contributed equally; 12 pages
☆ Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2
The training of large language models (LLMs) on extensive, unfiltered corpora
sourced from the internet is a common and advantageous practice. Consequently,
LLMs have learned and inadvertently reproduced various types of biases,
including violent, offensive, and toxic language. However, recent research
shows that generative pretrained transformer (GPT) language models can
recognize their own biases and detect toxicity in generated content, a process
referred to as self-diagnosis. In response, researchers have developed a
decoding algorithm that allows LLMs to self-debias, or reduce their likelihood
of generating harmful text. This study investigates the efficacy of the
diagnosing-debiasing approach in mitigating two additional types of biases:
insults and political bias. These biases are often used interchangeably in
discourse, despite exhibiting potentially dissimilar semantic and syntactic
properties. We aim to contribute to the ongoing effort of investigating the
ethical and social implications of human-AI interaction.
comment: 9 pages
♻ ☆ VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schmidt
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for
evaluation of instruction-following vision-language models for real-world use.
Our starting point is curating 70 'instruction families' that we envision
instruction tuned vision-language models should be able to address. Extending
beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to
game playing and creative generation. Following curation, our dataset comprises
592 test queries, each with a human-authored instruction-conditioned caption.
These descriptions surface instruction-specific factors, e.g., for an
instruction asking about the accessibility of a storefront for wheelchair
users, the instruction-conditioned caption describes ramps/potential obstacles.
These descriptions enable 1) collecting human-verified reference outputs for
each instance; and 2) automatic evaluation of candidate multimodal generations
using a text-only LLM, aligning with human judgment. We quantify quality gaps
between models and references using both human and automatic evaluations; e.g.,
the top-performing instruction-following model wins against the GPT-4 reference
in just 27% of the comparison. VisIT-Bench is dynamic to participate,
practitioners simply submit their model's response on the project website;
Data, code and leaderboard is available at visit-bench.github.io.
♻ ☆ InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration in Improving the Performance of Information Extraction
Ishani Mondal, Michelle Yuan, Anandhavelu N, Aparna Garimella, Francis Ferraro, Andrew Blair-Stanek, Benjamin Van Durme, Jordan Boyd-Graber
Learning template based information extraction from documents is a crucial
yet difficult task. Prior template-based IE approaches assume foreknowledge of
the domain templates; however, real-world IE do not have pre-defined schemas
and it is a figure-out-as you go phenomena. To quickly bootstrap templates in a
real-world setting, we need to induce template slots from documents with zero
or minimal supervision. Since the purpose of question answering intersect with
the goal of information extraction, we use automatic question generation to
induce template slots from the documents and investigate how a tiny amount of a
proxy human-supervision on-the-fly (termed as InteractiveIE) can further boost
the performance. Extensive experiments on biomedical and legal documents, where
obtaining training data is expensive, reveal encouraging trends of performance
improvement using InteractiveIE over AI-only baseline.
comment: Version 2
♻ ☆ Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search
Abstractive summarization systems today produce fluent and relevant output,
but often "hallucinate" statements not supported by the source text. We analyze
the connection between hallucinations and training data, and find evidence that
models hallucinate because they train on target summaries that are unsupported
by the source. Based on our findings, we present PINOCCHIO, a new decoding
method that improves the consistency of a transformer-based abstractive
summarizer by constraining beam search to avoid hallucinations. Given the model
states and outputs at a given step, PINOCCHIO detects likely model
hallucinations based on various measures of attribution to the source text.
PINOCCHIO backtracks to find more consistent output, and can opt to produce no
summary at all when no consistent generation can be found. In experiments, we
find that PINOCCHIO improves the consistency of generation (in terms of F1) by
an average of~67% on two abstractive summarization datasets.
comment: 16 pages, 2 figures, 7 tables
♻ ☆ Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness
Valentin Barriere, Felipe del Rio, Andres Carvallo De Ferari, Carlos Aspillaga, Eugenio Herrera-Berg, Cristian Buc Calderon
Artificial neural networks typically struggle in generalizing to
out-of-context examples. One reason for this limitation is caused by having
datasets that incorporate only partial information regarding the potential
correlational structure of the world. In this work, we propose TIDA (Targeted
Image-editing Data Augmentation), a targeted data augmentation method focused
on improving models' human-like abilities (e.g., gender recognition) by filling
the correlational structure gap using a text-to-image generative model. More
specifically, TIDA identifies specific skills in captions describing images
(e.g., the presence of a specific gender in the image), changes the caption
(e.g., "woman" to "man"), and then uses a text-to-image model to edit the image
in order to match the novel caption (e.g., uniquely changing a woman to a man
while maintaining the context identical). Based on the Flickr30K benchmark, we
show that, compared with the original data set, a TIDA-enhanced dataset related
to gender, color, and counting abilities induces better performance in several
image captioning metrics. Furthermore, on top of relying on the classical BLEU
metric, we conduct a fine-grained analysis of the improvements of our models
against the baseline in different ways. We compared text-to-image generative
models and found different behaviors of the image captioning models in terms of
encoding visual encoding and textual decoding.
♻ ☆ A Fair and In-Depth Evaluation of Existing End-to-End Entity Linking Systems
Existing evaluations of entity linking systems often say little about how the
system is going to perform for a particular application. There are two
fundamental reasons for this. One is that many evaluations only use aggregate
measures (like precision, recall, and F1 score), without a detailed error
analysis or a closer look at the results. The other is that all of the widely
used benchmarks have strong biases and artifacts, in particular: a strong focus
on named entities, an unclear or missing specification of what else counts as
an entity mention, poor handling of ambiguities, and an over- or
underrepresentation of certain kinds of entities.
We provide a more meaningful and fair in-depth evaluation of a variety of
existing end-to-end entity linkers. We characterize their strengths and
weaknesses and also report on reproducibility aspects. The detailed results of
our evaluation can be inspected under
https://elevant.cs.uni-freiburg.de/emnlp2023 . Our evaluation is based on
several widely used benchmarks, which exhibit the problems mentioned above to
various degrees, as well as on two new benchmarks, which address the problems
mentioned above. The new benchmarks can be found under
https://github.com/ad-freiburg/fair-entity-linking-benchmarks .
♻ ☆ Uncovering Intermediate Variables in Transformers using Circuit Probing
Neural network models have achieved high performance on a wide variety of
complex tasks, but the algorithms that they implement are notoriously difficult
to interpret. In order to understand these algorithms, it is often necessary to
hypothesize intermediate variables involved in the network's computation. For
example, does a language model depend on particular syntactic properties when
generating a sentence? However, existing analysis tools make it difficult to
test hypotheses of this type. We propose a new analysis technique -- circuit
probing -- that automatically uncovers low-level circuits that compute
hypothesized intermediate variables. This enables causal analysis through
targeted ablation at the level of model parameters. We apply this method to
models trained on simple arithmetic tasks, demonstrating its effectiveness at
(1) deciphering the algorithms that models have learned, (2) revealing modular
structure within a model, and (3) tracking the development of circuits over
training. We compare circuit probing to other methods across these three
experiments, and find it on par or more effective than existing analysis
methods. Finally, we demonstrate circuit probing on a real-world use case,
uncovering circuits that are responsible for subject-verb agreement and
reflexive anaphora in GPT2-Small and Medium.
♻ ☆ The Dark Side of the Language: Pre-trained Transformers in the DarkNet
Leonardo Ranaldi, Aria Nourbakhsh, Arianna Patrizi, Elena Sofia Ruzzetti, Dario Onorati, Francesca Fallucchi, Fabio Massimo Zanzotto
Pre-trained Transformers are challenging human performances in many NLP
tasks. The massive datasets used for pre-training seem to be the key to their
success on existing tasks. In this paper, we explore how a range of pre-trained
Natural Language Understanding models perform on definitely unseen sentences
provided by classification tasks over a DarkNet corpus. Surprisingly, results
show that syntactic and lexical neural networks perform on par with pre-trained
Transformers even after fine-tuning. Only after what we call extreme domain
adaptation, that is, retraining with the masked language model task on all the
novel corpus, pre-trained Transformers reach their standard high results. This
suggests that huge pre-training corpora may give Transformers unexpected help
since they are exposed to many of the possible sentences.
♻ ☆ Classifying COVID-19 vaccine narratives
Vaccine hesitancy is widespread, despite the government's information
campaigns and the efforts of the World Health Organisation (WHO). Categorising
the topics within vaccine-related narratives is crucial to understand the
concerns expressed in discussions and identify the specific issues that
contribute to vaccine hesitancy. This paper addresses the need for monitoring
and analysing vaccine narratives online by introducing a novel vaccine
narrative classification task, which categorises COVID-19 vaccine claims into
one of seven categories. Following a data augmentation approach, we first
construct a novel dataset for this new classification task, focusing on the
minority classes. We also make use of fact-checker annotated data. The paper
also presents a neural vaccine narrative classifier that achieves an accuracy
of 84% under cross-validation. The classifier is publicly available for
researchers and journalists.
comment: In Proceedings of the 14th International Conference on Recent
Advances in Natural Language Processing, 2023
♻ ☆ DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking
Inspired by the dual-process theory of human cognition, we introduce DUMA, a
novel conversational agent framework that embodies a dual-mind mechanism
through the utilization of two generative Large Language Models (LLMs)
dedicated to fast and slow thinking respectively. The fast thinking model
serves as the primary interface for external interactions and initial response
generation, evaluating the necessity for engaging the slow thinking model based
on the complexity of the complete response. When invoked, the slow thinking
model takes over the conversation, engaging in meticulous planning, reasoning,
and tool utilization to provide a well-analyzed response. This dual-mind
configuration allows for a seamless transition between intuitive responses and
deliberate problem-solving processes based on the situation. We have
constructed a conversational agent to handle online inquiries in the real
estate industry. The experiment proves that our method balances effectiveness
and efficiency, and has a significant improvement compared to the baseline.
♻ ☆ Insights Into the Nutritional Prevention of Macular Degeneration based on a Comparative Topic Modeling Approach
Topic modeling and text mining are subsets of Natural Language Processing
(NLP) with relevance for conducting meta-analysis (MA) and systematic review
(SR). For evidence synthesis, the above NLP methods are conventionally used for
topic-specific literature searches or extracting values from reports to
automate essential phases of SR and MA. Instead, this work proposes a
comparative topic modeling approach to analyze reports of contradictory results
on the same general research question. Specifically, the objective is to
identify topics exhibiting distinct associations with significant results for
an outcome of interest by ranking them according to their proportional
occurrence in (and consistency of distribution across) reports of significant
effects. The proposed method was tested on broad-scope studies addressing
whether supplemental nutritional compounds significantly benefit macular
degeneration (MD). Four of these were further supported in terms of
effectiveness upon conducting a follow-up literature search for validation
(omega-3 fatty acids, copper, zeaxanthin, and nitrates). The two not supported
by the follow-up literature search (niacin and molybdenum) also had scores in
the lowest range under the proposed scoring system, suggesting that the
proposed methods score for a given topic may be a viable proxy for its degree
of association with the outcome of interest and can be helpful in the search
for potentially causal relationships. These results underpin the proposed
methods potential to add specificity in understanding effects from broad-scope
reports, elucidate topics of interest for future research, and guide evidence
synthesis in a systematic and scalable way. All of this is accomplished while
yielding valuable insights into the prevention of MD.
♻ ☆ Who Wrote this Code? Watermarking for Code Generation
With the remarkable generation performance of large language models, ethical
and legal concerns about using them have been raised, such as plagiarism and
copyright issues. For such concerns, several approaches to watermark and detect
LLM-generated text have been proposed very recently. However, we discover that
the previous methods fail to function appropriately with code generation tasks
because of the syntactic and semantic characteristics of code. Based on
\citet{Kirchenbauer2023watermark}, we propose a new watermarking method,
Selective WatErmarking via Entropy Thresholding (SWEET), that promotes "green"
tokens only at the position with high entropy of the token distribution during
generation, thereby preserving the correctness of the generated code. The
watermarked code is detected by the statistical test and Z-score based on the
entropy information. Our experiments on HumanEval and MBPP show that SWEET
significantly improves the Pareto Frontier between the code correctness and
watermark detection performance. We also show that notable post-hoc detection
methods (e.g. DetectGPT) fail to work well in this task. Finally, we show that
setting a reasonable entropy threshold is not much of a challenge. Code is
available at https://github.com/hongcheki/sweet-watermark.
♻ ☆ PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models
Junlei Zhang, Hongliang He, Nirui Song, Shuyuan He, Shuai Zhang, Huachuan Qiu, Anqi Li, Lizhi Ma, Zhenzhong Lan
As Large Language Models (LLMs) are becoming prevalent in various fields,
there is an urgent need for improved NLP benchmarks that encompass all the
necessary knowledge of individual discipline. Many contemporary benchmarks for
foundational models emphasize a broad range of subjects but often fall short in
presenting all the critical subjects and encompassing necessary professional
knowledge of them. This shortfall has led to skewed results, given that LLMs
exhibit varying performance across different subjects and knowledge areas. To
address this issue, we present psybench, the first comprehensive Chinese
evaluation suite that covers all the necessary knowledge required for graduate
entrance exams. psybench offers a deep evaluation of a model's strengths and
weaknesses in psychology through multiple-choice questions. Our findings show
significant differences in performance across different sections of a subject,
highlighting the risk of skewed results when the knowledge in test sets is not
balanced. Notably, only the ChatGPT model reaches an average accuracy above
$70\%$, indicating that there is still plenty of room for improvement. We
expect that psybench will help to conduct thorough evaluations of base models'
strengths and weaknesses and assist in practical application in the field of
psychology.
♻ ☆ CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion NeurIPS 2023
Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang
Code completion models have made significant progress in recent years, yet
current popular evaluation datasets, such as HumanEval and MBPP, predominantly
focus on code completion tasks within a single file. This over-simplified
setting falls short of representing the real-world software development
scenario where repositories span multiple files with numerous cross-file
dependencies, and accessing and understanding cross-file context is often
required to complete the code correctly.
To fill in this gap, we propose CrossCodeEval, a diverse and multilingual
code completion benchmark that necessitates an in-depth cross-file contextual
understanding to complete the code accurately. CrossCodeEval is built on a
diverse set of real-world, open-sourced, permissively-licensed repositories in
four popular programming languages: Python, Java, TypeScript, and C#. To create
examples that strictly require cross-file context for accurate completion, we
propose a straightforward yet efficient static-analysis-based approach to
pinpoint the use of cross-file context within the current file.
Extensive experiments on state-of-the-art code language models like CodeGen
and StarCoder demonstrate that CrossCodeEval is extremely challenging when the
relevant cross-file context is absent, and we see clear improvements when
adding these context into the prompt. However, despite such improvements, the
pinnacle of performance remains notably unattained even with the
highest-performing model, indicating that CrossCodeEval is also capable of
assessing model's capability in leveraging extensive context to make better
code completion. Finally, we benchmarked various methods in retrieving
cross-file context, and show that CrossCodeEval can also be used to measure the
capability of code retrievers.
comment: To appear at NeurIPS 2023 (Datasets and Benchmarks Track)
♻ ☆ Hierarchical Catalogue Generation for Literature Review: A Benchmark EMNLP 2023
Scientific literature review generation aims to extract and organize
important information from an abundant collection of reference papers and
produces corresponding reviews while lacking a clear and logical hierarchy. We
observe that a high-quality catalogue-guided generation process can effectively
alleviate this problem. Therefore, we present an atomic and challenging task
named Hierarchical Catalogue Generation for Literature Review as the first step
for review generation, which aims to produce a hierarchical catalogue of a
review paper given various references. We construct a novel English
Hierarchical Catalogues of Literature Reviews Dataset with 7.6k literature
review catalogues and 389k reference papers. To accurately assess the model
performance, we design two evaluation metrics for informativeness and
similarity to ground truth from semantics and structure.Our extensive analyses
verify the high quality of our dataset and the effectiveness of our evaluation
metrics. We further benchmark diverse experiments on state-of-the-art
summarization models like BART and large language models like ChatGPT to
evaluate their capabilities. We further discuss potential directions for this
task to motivate future research.
comment: EMNLP 2023 findings
♻ ☆ GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors
Traditional Korean medicine (TKM) emphasizes individualized diagnosis and
treatment. This uniqueness makes AI modeling difficult due to limited data and
implicit processes. Large language models (LLMs) have demonstrated impressive
medical inference, even without advanced training in medical texts. This study
assessed the capabilities of GPT-4 in TKM, using the Korean National Licensing
Examination for Korean Medicine Doctors (K-NLEKMD) as a benchmark. The
K-NLEKMD, administered by a national organization, encompasses 12 major
subjects in TKM. We optimized prompts with Chinese-term annotation, English
translation for questions and instruction, exam-optimized instruction, and
self-consistency. GPT-4 with optimized prompts achieved 66.18% accuracy,
surpassing both the examination's average pass mark of 60% and the 40% minimum
for each subject. The gradual introduction of language-related prompts and
prompting techniques enhanced the accuracy from 51.82% to its maximum accuracy.
GPT-4 showed low accuracy in subjects including public health &
medicine-related law, internal medicine (2) which are localized in Korea and
TKM. The model's accuracy was lower for questions requiring TKM-specialized
knowledge. It exhibited higher accuracy in diagnosis-based and recall-based
questions than in intervention-based questions. A positive correlation was
observed between the consistency and accuracy of GPT-4's responses. This study
unveils both the potential and challenges of applying LLMs to TKM. These
findings underline the potential of LLMs like GPT-4 in culturally adapted
medicine, especially TKM, for tasks such as clinical assistance, medical
education, and research. But they also point towards the necessity for the
development of methods to mitigate cultural bias inherent in large language
models and validate their efficacy in real-world clinical settings.
comment: 23 pages, 4 figures
♻ ☆ Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
We propose the Data Contamination Quiz, a simple and effective approach to
detect data contamination in large language models (LLMs) and estimate the
amount of it. Specifically, we frame data contamination detection as a series
of multiple-choice questions. We devise a quiz format wherein three perturbed
versions of each dataset instance are created. These changes only include
word-level perturbations, replacing words with their contextual synonyms,
ensuring both the semantic and sentence structure remain exactly the same as
the original instance. Together with the original instance, these perturbed
versions constitute the choices in the quiz. Given that the only distinguishing
signal among these choices is the exact wording, an LLM, when tasked with
identifying the original instance from the choices, opts for the original if it
has memorized it in its pre-training phase--a trait intrinsic to LLMs. A
dataset partition is then marked as contaminated if the LLM's performance on
the quiz surpasses what random chance suggests. Our evaluation spans seven
datasets and their respective splits (train and test/validation) on two
state-of-the-art LLMs: GPT-4 and GPT-3.5. While lacking access to the
pre-training data, our results suggest that our approach not only enhances the
detection of data contamination but also provides an accurate estimation of its
extent, even when the contamination signal is weak.
comment: v1.1 preprint